Global Alignment of Molecular Sequences via Ancestral State Reconstruction
نویسندگان
چکیده
Molecular phylogenetic techniques do not generally account for such common evolutionary events as site insertions and deletions (known as indels). Instead tree building algorithms and ancestral state inference procedures typically rely on substitution-only models of sequence evolution. In practice these methods are extended beyond this simplified setting with the use of heuristics that produce global alignments of the input sequences—an important problem which has no rigorous model-based solution. In this paper we open a new direction on this topic by considering a version of the multiple sequence alignment in the context of stochastic indel models. More precisely, we introduce the following trace reconstruction problem on a tree (TRPT): a binary sequence is broadcast through a tree channel where we allow substitutions, deletions, and insertions; we seek to reconstruct the original sequence from the sequences received at the leaves of the tree. We give a recursive procedure for this problem with strong reconstruction guarantees at low mutation rates, providing also an alignment of the sequences at the leaves of the tree. The TRPT problem without indels has been studied in previous work (Mossel 2004, Daskalakis et al. 2006) as a bootstrapping step towards obtaining information-theoretically optimal phylogenetic reconstruction methods. The present work sets up a framework for extending these works to evolutionary models with indels. In the TRPT problem we begin with a random sequence x1, . . . , xk at the root of a d-ary tree. If vertex v has the sequence y1, . . . ykv , then each one of its d children will have a sequence which is generated from y1, . . . ykv by flipping three biased coins for each bit. The first coin has probability ps for Heads, and determines whether this bit will be substituted or not. The second coin has probability pd, and determines whether this bit will be deleted, and the third coin has probability pi and determines whether a new random bit will be inserted. The input to the procedure is the sequences of the n leaves of the tree, as well as the tree structure (but not the sequences of the inner vertices) and the goal is to reconstruct an approximation to the sequence of the root (the DNA of the ancestral father). For every χ > 0 we present an algorithm which outputs with probability 1−χ an approximation of x1, . . . , xk if pi + pd < O(1/k2/3 logn) and (1 − 2ps) > Cd−1 log d for some constant C > 0, and every large enough d. To our knowledge, this is the first rigorous trace reconstruction result on a tree in the presence of indels.
منابع مشابه
Historian: accurate reconstruction of ancestral sequences and evolutionary rates
Motivation Reconstruction of ancestral sequence histories, and estimation of parameters like indel rates, are improved by using explicit evolutionary models and summing over uncertain alignments. The previous best tool for this purpose (according to simulation benchmarks) was ProtPal, but this tool was too slow for practical use. Results Historian combines an efficient reimplementation of the...
متن کاملRobustness of Ancestral Sequence Reconstruction to Phylogenetic Uncertainty
Ancestral sequence reconstruction (ASR) is widely used to formulate and test hypotheses about the sequences, functions, and structures of ancient genes. Ancestral sequences are usually inferred from an alignment of extant sequences using a maximum likelihood (ML) phylogenetic algorithm, which calculates the most likely ancestral sequence assuming a probabilistic model of sequence evolution and ...
متن کاملComputational reconstruction of ancestral DNA sequences.
This chapter introduces the problem of ancestral sequence reconstruction: given a set of extant orthologous DNA genomic sequences (or even whole-genomes), together with a phylogenetic tree relating these sequences, predict the DNA sequence of all ancestral species in the tree. Blanchette et al. (1) have shown that for certain sets of species (in particular, for eutherian mammals), very accurate...
متن کاملgpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences
Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...
متن کاملExact and Heuristic Algorithms for the Indel Maximum Likelihood Problem
Given a multiple alignment of orthologous DNA sequences and a phylogenetic tree for these sequences, we investigate the problem of reconstructing the most likely scenario of insertions and deletions capable of explaining the gaps observed in the alignment. This problem, that we called the Indel Maximum Likelihood Problem (IMLP), is an important step toward the reconstruction of ancestral genomi...
متن کامل